By your favourite data enthusiast,
Surohit Tandon
So, what is this project about? Well, honestly, its just me having some fun with Airbnb's data. So, before I even begin to think about what questions I want to pose, I am going to dive into the data a bit to understand what questions naturally flow out of me. Below you will see a little bit of some EDA. Never fear, I will comment every code block so that we can take this journey together!!!
To get into the mood, click here.
Also, a link to the dataset.
# Importing my basic libraries for the EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
# Libraries I imported post EDA for other sections
from scipy.stats.stats import pearsonr
from scipy.stats import iqr
from sklearn.cross_validation import train_test_split
from sklearn.metrics import fbeta_score, accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score, f1_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# Imported for visualising the decision tree
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
# Reading in our beautiful data using Pandas.read_csv
calendar_df = pd.read_csv('seattle_data/calendar.csv')
listing_df = pd.read_csv('seattle_data/listings.csv')
review_df = pd.read_csv('seattle_data/reviews.csv')
# Ensuring all the data was read in correctly (This comment refers to the next two code blocks as well)
calendar_df.head()
listing_df.head()
review_df.head()
Hmm, so it seems all the data is read in correctly. But its kinda hard to see whats going on in the listing dataframe. So lets dive a little deeper.
# Having a look at all the columns of the listing dataframe
list(listing_df.columns)
# Looking at the data types that are present
listing_df.dtypes
# Looking at the list of object data types we may want to convert to keep
list(listing_df.columns[listing_df.dtypes == object])
listing_df.shape
# Looking at the datatypes for each dataframes column
calendar_df.dtypes
review_df.dtypes
listing_df.dtypes
# Looking at the number of columns
print ("The number of columns/features present in listing_df is {}".format(listing_df.shape[1]))
if listing_df.shape[1] > 40:
print ("Jeeze thats a lot of columns...")
else:
print ("Hmm, not too bad")
So, thats a bucket load of features all packed into one dataset. What immediately caught my semi trained eye(s) were pricing and reveiw scores. This immediately starting to tingle the ML side of my brain and I immediately thought "ooooooo, predictive algorithms heheh" but it might be wise to take a step and begin by really evaluated all of the data together and see what are the best questions we could ask. ML ain't always the answer. Unfortunately.
Below we are going to have to have a look at how much missing data we have. I'm gonna start with the smaller data sets since it most likely going to be easier to build visualisations for and get a general sense. HERE WE GOOOOOOO
# Looking at nulls in the review table
review_df.isnull().sum()
# Looking at the most recent date for our review dataframe
review_df.date.max()
# Reviewing the actual missing comments
review_df[review_df.comments.isnull()]
# Seeing what percentage of the original dataframe these missing values represent
print ("The percentage of the original dataframe these values make up is {0:.3f}%".format(100*review_df.isnull().sum()[5]/review_df.shape[0]))
if 100*review_df.isnull().sum()[5]/review_df.shape[0]<0.05:
print("Oh, this is a relatively insignifigant proportion of the data.")
else:
print ("Oh damn, thats a big chunk")
# Reviewing the nulls for the calendar dataframe
calendar_df.isnull().sum()
# Evaluating the dataframes null price rows
calendar_df[calendar_df.price.isnull()].tail()
# Looking at the proportion of missing data with respect to the original datafram
print ("The percentage of the original dataframe these values make up is {0:1.0f}%".format(100*calendar_df.isnull().sum()[3]/calendar_df.shape[0]))
if 100*calendar_df.isnull().sum()[3]/calendar_df.shape[0]<0.05:
print("Oh, this is a relatively insignifigant proportion of the data.")
else:
print ("Oh damn, thats a big chunk")
# Looking at the total size of the dataframe for future reference
print("Our dataframe consists of {} rows".format(calendar_df.shape[0]))
print("If we dropped the missing values we would have {} rows left over".format(calendar_df.shape[0]-calendar_df.isnull().sum()[3]))
Now, lets consider our missing data in the calendar dataframe. There is definitely a SIGNIFIGANT chunk of missing data. Let's consider the operative word, SIGNIFIGANT. What does that mean? Well, it means its a pain in the butt to deal with. We could ignorantly drop the rows (or worse the column itself), but since the column is valuable and the rows with missing data makes a large chunk of our valuable column, we need to consider imputing some data. We have several options that I know of for this:
Lets move on for now and look at the missing data in the listing dataframe.
# Looking at columns with missing data
listing_df.isnull().sum()
# Viewing the distribution of missing data
plt.hist(listing_df.isna().sum(axis=1),color='g')
plt.axvline(x=np.median(listing_df.isna().sum(axis=1)),color='r',label="median")
plt.legend()
title=("Histogram of listing's dataframe missing data with respect to rows")
plt.ylabel("Count")
plt.xlabel("Number of NaN values")
plt.title(title);
# Shape of our dataset
listing_df.shape
It seems there isn't that much missing data but honestly, the above means nothing to me. I think I want to see which columns contain missing data and then evaluate if I even want any of those columns or if I am happy to delete them from existance. So lets try that approach.
# Determining the number of null columns
(listing_df.isnull().sum() > 0).sum()
# Veiwing all the columns with missing data
list(listing_df.columns[listing_df.isnull().any()])
# Finding the number of numerical only column types
int_typ = (listing_df.dtypes == "int64").sum()
float_typ = (listing_df.dtypes == "float64").sum()
print ("So, the overall data set has {} columns. Since I know I wont be conducting any sentiment or text analysis".format(listing_df.shape[1]))
print ("I feel confident that we can drop our non-numerical data types")
print ("This would leave us with {} columns.".format(int_typ+float_typ))
# Checking the above list is the right size
if (listing_df.isnull().sum() > 0).sum() == len(list(listing_df.columns[listing_df.isnull().any()])):
print("Nice, we have the right list.")
else:
print("Whoops")
# Displaying which columns would be left for analysis
numbered_columns = list(listing_df.columns[(listing_df.dtypes == 'int64')]) + list(listing_df.columns[(listing_df.dtypes == 'float64')])
numbered_columns
# Finding the % missing for our new dataframe
listing_new = listing_df[numbered_columns]
listing_new.isnull().sum()/listing_new.shape[0]
# Viewing the distribution of missing data
plt.hist(listing_new.isna().sum(axis=1),color='g')
plt.axvline(x=np.median(listing_new.isna().sum(axis=1)),color='r',label="median")
plt.legend()
title=("Histogram of listing's dataframe missing data with respect to rows")
plt.ylabel("Count")
plt.xlabel("Number of NaN values")
plt.title(title);
So what can we take away from our little dive into the data?
Okay, so. Questions. This is arguably the hardest part of this little project but here goes.
Lets condense what we need to clean below:
As we all know, data analysis can be a very iterative purpose. These are all of my initial considerations. There may be more that arise in which case I will append those points below.
# Looking at the list of columns that had a lot of missing data
list(listing_new.columns[listing_new.isnull().sum()/listing_new.shape[0]>0.3])
# Compiling a list of columns we want to drop
dropping = list(listing_new.columns[listing_new.isnull().sum()/listing_new.shape[0]>0.3])
# Removing columns in listing dataframe
listing_new.drop(dropping, axis=1, inplace=True)
# new dataframe ensuring our dataframe has 2 less columns since we wanted to drop 2 columns
if listing_new.shape[1] == 30-2:
print("We have the correct shape!")
# RE-viewing the distribution of missing data
plt.hist(listing_new.isna().sum(axis=1),color='g')
plt.axvline(x=np.median(listing_new.isna().sum(axis=1)),color='r',label="median")
plt.legend()
title=("Histogram of listing's dataframe missing data with respect to rows")
plt.ylabel("Count")
plt.xlabel("Number of NaN values")
plt.title(title);
# Finding all the columns that need imputing
impute_list = list(listing_new.columns[listing_new.isna().sum()/listing_new.shape[0]>0])
impute_list
# Exploring some averages averages
listing_new.describe()
# Imputing the means for each feature
for im in impute_list:
listing_new[im].fillna(listing_new[im].mean(),inplace=True)
# Checking we dont have any nulls
if listing_new.isnull().any().sum() == 0:
print("We have filled our dataset!")
# dropping our null comments rows
review_new = review_df[review_df.comments.isnull() == False]
# Converting the date to datetime
review_new['date'] = pd.to_datetime(review_new['date'], infer_datetime_format= True)
# Checking conversion
review_new.dtypes
# Checking we removed the rows
review_new.isna().sum()
# Looking at all the values in our dataframe
calendar_df.price.value_counts(dropna=False)
# Creating a dummy dataframe
dummy_data = pd.get_dummies(calendar_df['price'], dummy_na = True)
dummy_data.head()
# Dropping all the rows except for our nan column
dummy_data.drop(list(dummy_data.columns)[:-1],axis=1, inplace = True)
dummy_data.head()
# Appending nan column to our original calendar dataframe
df_calendar_merged = calendar_df.merge(dummy_data, how='outer', left_index=True, right_index=True)
# Checking we correctly merged the data
df_calendar_merged.head()
# Had issues accessing the nan column, so decided to rename it manually
some = list(df_calendar_merged.columns)
some[4] = 'NaN'
# Checking our list is correct
some
# Reassigning our column names
df_calendar_merged.columns = some
# Checking our column names came out right
df_calendar_merged.columns
# Checking our datatypes worked
df_calendar_merged.dtypes
# converting our NaN column to an integer column
df_calendar_merged.NaN = df_calendar_merged['NaN'].astype('int64', copy=False)
# Checking we correctly converted our columns
df_calendar_merged.dtypes
# Removing the $ and , signs
df_calendar_merged.price = df_calendar_merged.price.str.replace('$', '')
df_calendar_merged.price = df_calendar_merged.price.str.replace(',', '')
# Checking the above worked
df_calendar_merged.price.head()
# converting our price column to an integer column
df_calendar_merged.price = df_calendar_merged['price'].astype('float64', copy=False)
# Checking the conversion worked
df_calendar_merged.dtypes
# Imputing the mean
df_calendar_merged['price'].fillna(df_calendar_merged['price'].mean(),inplace=True)
# Ensuring the mean was imputed
df_calendar_merged['price'].head()
# Converting the date to datetime
df_calendar_merged['date'] = pd.to_datetime(df_calendar_merged['date'], infer_datetime_format= True)
# Checking the above worked
df_calendar_merged.dtypes
# Double check to ensure we dont have any NaN's in our available column
df_calendar_merged.available.value_counts(dropna=False)
# Creating a dummy dataframe
dummy_avail = pd.get_dummies(df_calendar_merged['available'])
dummy_avail.head()
# Appending avail columns to our original calendar dataframe
df_calendar_merged = df_calendar_merged.merge(dummy_avail, how='outer', left_index=True, right_index=True)
# Checking the merge
df_calendar_merged.head()
# Dropping our available column
df_calendar_merged.drop(['available'],axis=1,inplace=True)
# Checking the drop worked
df_calendar_merged.head()
# Renaming for neatness
df_calendar_merged.rename(index=str, columns={"f": "available_f", "t": "available_t"},inplace=True)
# checking the new names
df_calendar_merged.head()
# converting our avail columns to an integer column
df_calendar_merged.available_f = df_calendar_merged['available_f'].astype('int64', copy=False)
df_calendar_merged.available_t = df_calendar_merged['available_t'].astype('int64', copy=False)
# checking the new names
df_calendar_merged.dtypes
# Final checks before we conduct any merges
review_new.head()
# Final checks before we conduct any merges
listing_new.head()
# a quick second look at the number of rows before we merge
print ("We have {} rows in our calendar dataset and {} rows in our listing dataset".format(df_calendar_merged.shape[0],listing_new.shape[0]))
# Merging our listing_new and df_calendar_merged dataframes!!!
final_df = df_calendar_merged.merge(listing_new, how='outer', left_on='listing_id', right_on='id')
# making sure we didnt join resulting in NaNs
final_df.isna().sum()
# checking our dtypes are all gucci
final_df.dtypes
WOW. That was a lot. But hey, we did it, right?.. right?
Moving on, lets take a look at our actual questions. We should be in a good position to analyze our data set and try to evaluate these questions. I'm going to seperate them into their own individual parts so that we can tackle each question on their own. Below we can review the questions:
So, lets begin with question 1.
so, How often are residences typically occupied? How does this vary with the price?. Let's break this down.
So, we know have an approach for this question. Let's try to execute.
# Looking at the columns we need
list(final_df.columns)
Hmm, since we dont have explicit information on how long each listing is ocuppied for per occupation. We may be able to replace this with "Reviews per month" assuming a constant review rate across all houses. Our reviews per month give us a utilization rate of each property per month.
Assumptions:
The reason I didnt go with the available columns is that a residence may be unavailable due to reasons other than being occupied... oooh look at that domain interpretation.
# Looking at the time the data set spans over
print ("We begin this data set at {} and end at {}".format(min(final_df.date),max(final_df.date)))
# Viewing our reviews column
final_df.reviews_per_month.describe()
# Viewing our reviews per month with avaerage measures
plt.figure(figsize=(10, 10), dpi= 50, facecolor='w', edgecolor='k')
ax = sns.distplot(final_df.reviews_per_month, bins=50, kde=True, color="g")
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(25)
plt.axvline(final_df.reviews_per_month.median(),color='r', label = "Median")
plt.axvline(final_df.reviews_per_month.describe()[4], label = "25th Percentile")
plt.axvline(final_df.reviews_per_month.describe()[6], label = "75th Percentile", color='k')
plt.axvline(final_df.reviews_per_month.mean(), color='y', label ="Mean")
plt.legend(fontsize=20)
plt.xlim(0,13)
plt.ylim(0,1)
plt.xlabel('Occupation Rate (Per Month)', size=30)
plt.title("Histogram of reviews per month", size=30);
We seem to have two peaks within this data set. This may identify two potential clusters within our data if we considered a guassian mixture model. But this would come way later. For now, we can identify that the data set is clearly skewed to the right (even if there isnt a large gap between our mean and median) and that it clearly follows a non-normal distribution.
print ("The IQR of our occupation rate is {}".format(iqr(final_df.reviews_per_month)))
So, to answer the 1st part of our question. Residences are often occupied between 1 and 3 times per month in Seattle. We do seem to have values close to or about 0 since our reviews_per_month column seems to have been calculated through aggragating down from year to month since having less that 1 review per month makes no sense (divided by 12).
# Creating a scatter plot to view any potential relationship
plt.figure(figsize=(10, 10), dpi= 40, facecolor='w', edgecolor='k')
ax = sns.regplot(y="price", x="reviews_per_month", fit_reg=True ,data=final_df,
scatter_kws={"s": 3, "alpha": 0.003, "color": "red"})
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(25)
plt.title('Relationship of Price and Occupation Rate', size = 30)
plt.ylabel('Price ($)', size = 30)
plt.xlabel('Occupation Rate (Per Month)', size = 30);
It is rather difficult to see much of a relationship across this plot. While the trendline seems to be negative we may benifit from aggragating our data into "buckets" and analyzing again.
Note: It is difficult to inteperate our data due to the noise that is clearly prevalent.
# Compiling a simple function to bucket our reviews into a new variable
def review_buck(x):
if x>0 and x<2:
return 1
if x>=2 and x<4:
return 3
if x>=4 and x<6:
return 5
if x>=6 and x<8:
return 7
if x>=8 and x<10:
return 9
if x>=10 and x<12:
return 11
else:
return 13
# Creating our new variable
final_df['review_bucket'] = final_df.reviews_per_month.apply(review_buck)
# Looking at where most of our data lies
final_df.review_bucket.value_counts()
# Creating a smoother visual for price and occupancy rate
plt.figure(figsize=(15, 15), dpi= 40, facecolor='w', edgecolor='k')
ax = sns.barplot(x="review_bucket", y="price", data=final_df)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(25)
plt.ylabel('Mean Price ($)', fontsize = 25)
plt.xlabel('Occupation Rate (Per Month)' , fontsize = 25);
We can see a spark disparity between the price of frequently occupied (>11) and rarely occupied (<3).
Note: It is important to note we have a lack of data in the higher occupation rate buckets which makes it a difficult comparison to interpret.
# Calculating the Pearson correlation coefficient between our two variables
print("The Pearson r correlation coefficient between price and and occupation rate is {0:.4f}%.".format(pearsonr(final_df.reviews_per_month,final_df.price)[0]))
print("The p value for this is {}.".format(pearsonr(final_df.reviews_per_month,final_df.price)[1]))
so, Which are the key relationships that help to predict our review scores? .
The approach for this question is rather straightforward. We can create a clean visual by displaying a seaborn heatmap to try and pick out the clear variables affecting our review scores.
# Creating a correlation matrix to identify any potential relationships
corr = final_df.drop('date', axis=1).corr()
plt.figure(figsize=(30, 30), dpi= 40, facecolor='w', edgecolor='k')
ax = sns.heatmap(corr,annot=True, cmap="YlGnBu",
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(40);
So, did we answer our question: How often are residences typically occupied? How does this vary with the price?.
Now, let us focus on Which are the key relationships that help to predict our review scores?.
So it is clear that there seems to be a high degree of colinearity between all the review scores which makes sense considering that if the overall experience was positive then most of the reveiw scores would be positive and vice versa.
Apart from the other review scores that seem to be clustered together in our heatmap, there are no real key predictors that we can identify for our review scores.
so, Can we predict if a house will receive a 'good' score?. Let's break this down.
So, we know have an approach for this question. Let's try to execute.
# Lets begin by creating a new dataframe without certain unnecesary columns
df = final_df.drop(['date','review_bucket','listing_id'],axis=1)
# Checking we correctly dropped the columns in our new dataframe
list(df.columns)
# Creating an aggregated target variable
df['review_score'] = (df['review_scores_rating']/10+df['review_scores_accuracy']+df['review_scores_cleanliness']+
df['review_scores_checkin']+df['review_scores_communication']+
df['review_scores_location']+df['review_scores_value'])/7
df.head()
# Understanding our distribution of scores before setting a "bad score"
df.review_score.hist()
plt.axvline(df.review_score.median(),color='r', label = "Median")
plt.axvline(df.review_score.mean(),color='g', label = "Mean")
plt.legend(fontsize=20);
# Looking at the values present in our review scores
df.review_score.value_counts()
# In the case of supervised models, we're going to want our score to be our target variable
# We also want to split our data into training and testing sets
df_train = df.drop([ 'review_scores_rating',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'review_scores_value',
'review_score'], axis=1)
df_score = df['review_score']
# Lets transform our review_score so that below the mean represents a "bad" score, and above the mean is a "good" score
mean = df.review_score.mean()
# Encoding the 'income_raw' data to numerical values
score = df_score.apply(lambda x: 1 if x > mean else 0)
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_train,
score,
test_size = 0.2,
random_state = 0)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Below, I am going to try several different classifiers and evaluate them to see how they perform.
# Running an AdaBoost classifier on our data
ada_mod = AdaBoostClassifier(n_estimators=100, learning_rate=0.2)
# Fit our model to the training data
ada_mod.fit(X_train, y_train)
# Predict on the test data
predictions = ada_mod.predict(X_test)
# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions.round())))
print('Precision score: ', format(precision_score(y_test, predictions.round())))
print('Recall score: ', format(recall_score(y_test, predictions.round())))
print('F1 score: ', format(f1_score(y_test, predictions.round())))
Hmm. We could parameter tune our model but I feel a different model would preform better.
# Define the classifier, and fit it to the data
modelTree = DecisionTreeClassifier()
modelTree.fit(X_train, y_train)
# Making predictions
y_test_pred = modelTree.predict(X_test)
# Calculate the accuracy
print('Accuracy score: ', format(accuracy_score(y_test, y_test_pred)))
print('Precision score: ', format(precision_score(y_test, y_test_pred)))
print('Recall score: ', format(recall_score(y_test, y_test_pred)))
print('F1 score: ', format(f1_score(y_test, y_test_pred)))
Wow, seems like we have a winner. Can't beat those scores.
Lets see if we can visualise this below.
# Visualising the decision tree model
dot_data = StringIO()
export_graphviz(modelTree, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())